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* ^ Abstract - Complex networks have been employed to model many real systems and as a modeling 

tool in a myriad of applications. In this paper, we use the framework of complex networks to the 
problem of supervised classification in the word disambiguation task, which consists in deriving a 
Oh function from the supervised (or labeled) training data of ambiguous words. Traditional supervised 

data classification takes into account only topological or physical features of the input data. On 
I the other hand, the human (animal) brain performs both low and high level orders of learning 

and it has facility to identify patterns according to the semantic meaning of the input data. In 
this paper, we apply a hybrid technique which encompasses both types of learning in the field 
of word sense disambiguation and show that the high level order of learning can really improve 
the accuracy rate of the model. This evidence serves to demonstrate that the internal structures 
formed by the words do present patterns that, generally, cannot be correctly unveiled by only 
traditional techniques. Finally, we exhibit the behavior of the model for different weights of the 
low and high level classifiers by plotting decision boundaries. This study helps one to better 

£f*} understand the effectiveness of the model. 



Introduction. — Language is present everywhere and to model interactions between words or segments of a 

pervades all aspects of our daily life since the dawn of text 1 3—5 . In the last few years, complex networks (CN) 

humanity. Although it has been largely studied, several have been used to study both theoretical and practical 

issues remain open, such as the explanation of the emer- aspects of language. Examples of recent theoretical find- 

gence of fundamental laws such as the Zipf 's Law 1 . Cur- ings using such a robust model include the verification 

rently, language has not been exclusively studied by lin- of universal properties K3j and the modeling of adjacency 

guists or psychologists. Physicists have borrowed some of networks. From the practical perspective, complex net- 

their tools to study emergent linguistic patterns. For ex- works have been used to summarize texts 6 , to assess the 

ample, complex systems [2], which are characterized by quality of machine translators m, to group and classify 

agents interacting in a non-trivial way, have been used data [8j[9], and others. 
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In the current paper, we assess the ability of complex 
networks for the Word Sense Disambiguation (WSD) task 
(i.e., the discrimination of which of the meanings is used 
in a given context for a word that has multiple meanings). 
The importance of the WSD task stems from its essential 
role played for the development of the so called Semantic 
Web. Also, the WSD task is essential for machine trans- 
lation research [7 . Although a myriad of strategies have 
been developed so far, none of them evaluated the use- 
fulness of complex networks both to model texts and to 
recognize patterns arising from the topological and seman- 
tical relationship among senses. For this reason, we apply 
a novel generalized methodology based on the concept of 
complex networks [To] in the field of WSD. First, networks 
were employed to model the relationship between words 
in written texts from which it was possible to characterize 
both the semantical and topological properties of words 
inserted in a given semantic context (see Section 1.2 of 
the Supplementary Information^] (SI)). Then, the similar- 
ity relationship given by such a characterization was mod- 
eled in the form of networks in order to extract and exploit 
patterns among the data in the networked representation. 
Interestingly, assuming that the description of senses in 
the resulting space is not made up of isolated points, but 
instead tend to form certain patterns, we found that it is 
possible to improve the discrimination when we compare 
the performance achieved with traditional classifiers. 

Overview of the Technique. — In this section, we 
review the hybrid high level technique [To]. Consider a 
training X training = {(x^ 2/1), . . . , {x h y t )}, where the first 
component of the ith tuple X{ = . . . , fd) denotes the 
attributes of the d-dimensional ith training instance. The 
second component yi G £ = {Li,...,L n } characterizes 
the class label or target associated to that training in- 
stance. The goal here is to learn a mapping from x \-> y. 
Usually, the constructed classifier is checked by using a 
test set X t est — { x i+i-> • • • in which labels are not 

provided. In this case, each data item is called test in- 
stance. 

In the supervised learning scheme, there are two phases 
of learning: the training phase and the classification phase. 
In the training phase, the classifier is induced or trained 
by using the training instances (labeled data) in X trainin g. 
In the classification phase, the labels of the test instances 
in X test are predicted using the induced classifier. Below, 
these two phases are presented in detail. 

In the training phase, the data in the training set are 
mapped into a graph G using a network formation tech- 
nique g : X training *-> G = (V, E) , where V = {1, . . . , V} 
is the set of vertices and E is the set of edges. Each vertex 
in V represents a training instance in Xtraining- As it will 
be described later, the pattern formation of the classes will 
be extracted by using the complex topological features of 
this networked representation. 

lr The Supplementary Information (SI) is hosted at |http: //dl . | 
dr opb ox . com/u/2740286/epl_SI_9apr . pdf] 
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Fig. 1: (a) Schematic of the network in the training phase, (b) 
Schematic of how the classification inference is done. 

The edges in E are created using a combination of the 
e r and k- nearest neighbors (acNN) graph formation tech- 
niques. In the original versions, the e r technique creates a 
link between two vertices if they are within a distance e, 
while the ftNN sets up a link between vertices i and j if i is 
one of the k nearest neighbors of j or vice versa. Both ap- 
proaches have their limitations when sparsity or density is 
a concern. For sparse regions, the ftNN forces a vertex to 
connect to its k nearest vertices, even if they are far apart. 
In this scenario, one can say that the neighborhood of this 
vertex would contain dissimilar points. Equivalent ly, im- 
proper e values could result in disconnected components, 
sub-graphs, or isolated singleton vertices. 

The network is constructed using these two tradi- 
tional graph formation techniques in a combined form. 
The neighborhood of a vertex Xi is given by N(xi) = 
e r (xi,y Xi ), if \e r (xi, y x .\ > k. Otherwise, N(xi) = 
K.(xi,y x .), where y x . denotes the class label of the train- 
ing instance x^ e r (xi,y x .) returns the set {xj,j G V : 
d(xi,Xj) < e A y Xi = y Xj }, and K,(xi,y Xi ) returns the 
set containing the k nearest vertices of the same class as 
Xi. Note that the e r technique is used for dense regions 
(\e r (xi)\ > k), while the ftNN is employed for sparse re- 
gions. With this mechanism, it is expected that each class 
will have a unique and single graph component. 

For the sake of clarity, Fig. [la] shows a schematic of 
how the network looks like for a three-class problem when 
the training phase has been completed. In this case, each 
class holds a representative component. In the figure, the 
surrounding circles denote these components: Qc i: Gc 2 i 
and Qc 3 • 

In the classification phase, the unlabeled data items in 
the Xtest are presented to the classifier one by one. In 
contrast to the training phase, the class labels of the test 
instances are unknown. In this way, each test instance 
is inserted into the network only using the traditional e r 
technique, meaning it is connected to every vertex within 
this radius, no matter to which class each vertex in this re- 
gion belongs. Once the data item is inserted, each class an- 
alyzes, in isolation, its impact on the respective class com- 
ponent using the complex topological features of it. In the 
high level model, each class retains an isolated graph com- 
ponent. Each of these components calculate the changes 



p-2 



Word Sense Disambiguation Via High Order of Learning in Complex Networks 



that occur in its pattern formation with the insertion of 
this test instance. If slight or no changes occur, then it 
is said that the test instance is in compliance with that 
class pattern. As a result, the high level classifier yields 
a great membership value for that test instance on that 
class. Conversely, if these changes dramatically modify 
the class pattern, then the high level classifier produces a 
small membership value on that class. These changes are 
quantified via network measures, each of which numeri- 
cally translating the organization of the component from 
a local to global fashion. As we will see, the average de- 
gree, clustering coefficient, and the assortativity measures 
are employed for the high level order of learning. 

For the sake of clarity, Fig. [lb] exhibits a schematic 
of how the classification process is performed. The test 
instance (triangle-shaped) is inserted using the traditional 
e r technique. Due to its insertion, the class components 



become altered: Q { 



and Gc 3 i wnere each of them 



lb It 



is a component surrounded by a circle in Fig. 
may occur that some class components do not share any 
links with this test instance. In the figure, this happens 
with Qc 3 - I n this case, we say that test instance do not 
comply to the pattern formation of the class component. 
For the components that share at least a link (Q Cl and 
Gc 2 )i each of it calculates, in isolation, the impact on its 
pattern formation by virtue of the insertion of the test 
instance. For example, when we check the compliance of 
the test instance to the component Q Cl , the connections 
from the test instance to the component Q C2 are ignored, 
and vice versa. 

Concurrently to the prediction made by the high level 
classifier, a low level classifier also predicts the member- 
ship of the test instance for every class in the problem. 
The way it predicts depends on the choice of the low level 
classifier. In the end, the predictions produced by both 
classifiers are combined via a linear combination to derive 
the prediction of the high level framework (meta- learning). 
Once the test instance gets classified, it is either discarded 
or incorporated to the training set with the corresponding 
predicted label. In the second case, the classifier must be 
retrained. Note that, in any of the two situations, each 
class is still represented by a single graph component. 

The High Level Classification. — The hybrid clas- 
sifier M consists of a convex combination of two terms: 
(i) a low level classifier (C4.5 
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kNN 
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or Naive 



Bayes 11 ^ and (ii) a high level classifier, which is re- 
sponsible for classifying a test instance according to its 
pattern formation with the data. Mathematically, the 

membership of the test instance xi G X tes t with respect 

(i) 

to the class j G £, here written as M- , is given by: 



(j) 



(j) 



(1) 



where G [0, 1] denotes the membership of the test in- 
stance Xi on class j produced by an arbitrary traditional 



(low level) classifier; G [0, 1] indicates the same mem- 
bership information yielded by a high level classifier; and 
A G [0, 1] is the compliance term, which plays the role 

of counterbalancing the classification decision supplied by 

(i) 

both low and high level classifiers. Whenever T\ = 1 
and = 1, we may deduce that the zth data item car- 
ries all the characteristics of class j. On the other hand, 
whenever = and = 0, we may infer that the iih 
data item does not present any similarities nor complies to 
the pattern formation of class j. Values in-between these 
two extremes lead to natural uncertainness in the classi- 
fication process and are found in the majority of times 
during a classification task. Note that Eq. generates 
fuzzy outputs. Moreover, it is valuable to indicate that, 
when A = 0, Equation reduces to a common low level 
classifier. A test instance receives the label from the class 
j that maximizes ([I]). 

The inference of pattern formation, which is used by the 
classifier C, within the data is processed using the gener- 
ated network. The motivation behind using networks is 
that it can describe topological structures among the data 
items. These networks are constructed such that: (i) each 
class is an isolated subgraph (component) and (ii) after 
the insertion of a new test instance, each class must still 
retain a representative and unique component. With that 
in mind, the pattern formation of the data is quantified 
through a combination of network measures developed in 
the complex network literature. These measures are cho- 
sen in a way to cover relevant high level aspects of the 
class component. Suppose that K measures are selected 
to comprise the high level classifier C. Mathematically, 
the membership of the test instance X{ G X tes t with re- 
spect to the class j G C yielded by the high level classifier, 
here written as is given by: 



a 



Ef=i «(«) 











(2) 



where a(u) G [0,1], Vu G {l,...,if}, £«=i a(u) = 1, 
are user-controllable coefficients that indicate the influ- 
ence of each network measure in the classification process 
and f^\u) is a function that depends on the uth net- 
work measure applied to the zth data item with regard 
to the class j. This function is responsible for providing 
an answer whether the test instance xi presents the same 
patterns of the class j or not. The denominator in ([2| has 
been introduced solely for normalization matters. 

With respect to f^\u), it possesses a general closed 
form given by: 



fl i \u) = AG\»(u)p< 



U), 



(3) 



2 A brief description of the low level classifiers is given in the SI. 



where AG^\u) G [0,1] is the variation of the uth net- 
work measure that occurs on the component representing 
class j if Xi joins it and G [0, 1] is the proportion of 
data items pertaining to the class j. Remembering that 



P-3 



T. C. Silva and D. R. Amancio 



each class has a component representing itself, the strat- 
egy to check the pattern compliance of a test instance is to 
examine whether its insertion causes a great variation of 
the network measures representing the class component. 
In other words, if there is a small change in the network 
measures, the test instance is in compliance with all the 
other data items that comprise that class component, i.e., 
it follows the same pattern as the original members of that 
class. On the other hand, if its insertion is responsible for 
a significant variation of the component's network mea- 
sures, then probably the test instance may not belong to 
that class. 

We proceed to explain the role of the p^ E [0, 1] in 
In real-world databases, unbalanced classes are usually en- 
countered. In general, a database frequently encompasses 
several classes of different sizes. A great portion of the 
network measures are very sensitive to the size of the com- 
ponents. In an attempt to soften this problem and cancel 
out the effects of distinct components' sizes, ([3| introduces 
the term p^\ which is the proportion of vertices that class 
j has. 

Composition of the High Level Classifier. The net- 
work measurements that compose the high level classifier 



are the assortativity 12 , the clustering coefficient, and the 



average degree. The reason why these three measures have 
been chosen is as follows: the average degree measure fig- 
ures out strict local scalar information of each vertex in the 
network; the clustering coefficient of each vertex captures 
local structures by means of counting triangles formed by 
the current vertex and any of its two neighbors; the assor- 
tativity coefficient considers not only the current vertex 
and its neighbors, but also the second level of neighbors 
(neighbor of neighbor), the third level of neighbors, and so 
on. We can perceive that the three measures characterize 
the network's topological properties in a local to global 
fashion. In this way, the combination of these measures is 
expected to capture the pattern formation of the under- 
lying network in a systematic manner. Details regarding 
these three measurements are given in the SI. 

Results and Discussion. — First, the methodology 
is applied to an artificial database in order to better un- 
derstand its functionality. Afterwards, the WSD problem 
is analyzed. The discussion of the observed results is given 
below. 

High Level Applied to a Toy Database. As an intro- 
ductory example, consider the toy data set depicted in 
Fig. |2j where there are two classes: the red or "star" (52 
vertices) and the green or "square" (276 vertices) classes. 
This example serves as a gist of how the hybrid classi- 
fier draws its decisions. In the training and classification 
phases, we employ n = 3 and e = 0.04 for the network con- 
struction. The fuzzy SVM 13 with RBF kernel [C = 70 
and 7 = 2 _1 ) is adopted for the low level classifier. By 
inspection of the figure, the red or "star" class displays a 
well-defined pattern: a grid or lattice, whereas the green 
or "square" class does not indicate any well-established 



patterns. Here, the goal is to classify the cross-shaped 
data items (test set) one by one using only the informa- 
tion of the training set. Figures [2a| [2bJ and [2c] exhibit 
the decision boundaries of the two classes when A = 0, 
A = 0.5, and A = 0.8, respectively. When A = 0, only 
the SVM prediction is used by the hybrid technique. In 
this case, one can see that the five data items are not cor- 
rectly classified. Notice that the decision boundaries are 
pushed near the red or "star" class by virtue of the large 
amount of green or "square" items in the vicinity. Now, 
when A = 0.5, the SVM and the high level classifier predic- 
tions are utilized in the same intensity. In this situation, 
the decision boundaries are dragged toward the green or 
"square" class, because of the strong pattern that the red 
or "star" class exhibits. We can think this phenomenon as 
being a clash between the two decision boundaries: as A 
increases, the more structured class tends to possess more 
decision power, and, consequently, is able to reduce the 
effective area of the competing class. For example, when 
A = 0.8, the organizational features of the red or "star" 
class are so salient that its effective area invades the high 
density region of the green or "square" class. In the two 
former cases, the hybrid high level technique can success- 
fully classify the cross-shaped data items. In summary, the 
concept of classification is altered depending on the value 
of the compliance term. A small compliance term causes 
the final decision of the hybrid classifier to be rooted in 
traditional assumptions of low level classifiers. Now, when 
a large compliance term is used, the salient characteristic 
that the hybrid classifier attempts to emphasize is the pat- 
terns that the classes display. As the structural pattern 
of a class becomes stronger, wider will be the delineated 
decision boundary for that class. 

High Level Applied to Word Sense Disambiguation. 
The efficiency of the high level classifier is also verified in 
a real- world application. In this case, we aim at discrim- 
inating senses of ambiguous words (i.e., words with the 
same lexical form but with different senses)]^] Using the 



database presented in Ref 14 , two approaches for char- 
acterizing senses are employed: the topological and the 
semantical approach. In the former, each occurrence of 
a word is characterized by its local structure in the word 
adjacency network [l5]. In the latter, each word sense is 
represented by the frequency of the w nearby words. De- 
tails of these two methodologies are given in the SI. 

Table [I] shows the results obtained for the five ambigu- 
ous words in the topological approach. Similarly, Table 
[2] depicts the results obtained for the semantic approach. 
In both cases, when selecting the suitable value of the pa- 
rameter A, it is possible to improve the efficiency of the 
classification achieved by the low level classifiers (C4.5, 
kNN and Naive Bayes). Moreover, because A is differ- 
ent from zero in most cases, one can infer that there is 
a pattern in the data organized in the attribute space. 



3 For example, the word "bear" might be either related to a large 
mammal of the family it Ursidae or to the verb "carry" . 
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Fig. 2: Behavior of the decision boundaries as A varies in the toy data. Decision boundaries when (a) A = 0; (b) A = 0.5; and 
(c) A = 0.8. 



Interestingly, one can conclude that the structural organi- 
zation of the words in complex networks is not only useful 
for discriminating senses when modeling the relationship 
of words in a text, but also when modeling the relation- 
ship between words in the attribute space. In other words, 
when word senses are analyzed with the complex network 
framework, patterns emerge both in the organization of 
words in the adjacency network adjacency (before charac- 
terization) and in the network built in the attribute space. 
These unveiled patterns, in turn, cannot be properly dis- 
covered by traditional techniques. This reasoning explains 
the performance boost that occurred when a A / was 
employed in the experiments. 

Conclusion. — In the current paper, we have applied 
a novel methodology of supervised data classification in 
the field of word sense disambiguation. The hybrid classi- 
fier is comprised of a combination of traditional (low level) 
and pattern-based classifiers. The latter uses a network 
to exploit the topological patterns in search of patterns. 
From the analysis of the experiments, we have found that 
the inclusion of the high level term was responsible for 
improving the ability of classification both in artificial 
and real-world networks. Specifically, in the latter, the 
methodology devised in Ref. [14] was improved as a con- 
sequence that words conveying the same meaning display 
organizational patterns not only in textual level but also 
in the attribute space. This argument serves to strengthen 
the fact that networks constructed using words are not to- 
tally disorganized. Instead, each set of words tend to form 
patterns that uniquely describe it. The hybrid framework 
exactly attempts to extract these hidden patterns that are 
cloaked within the word relationships (edges) in the net- 
work. 

Because the hybrid high level technique is totally 
generic, we intend to use it in other real- world applica- 
tions, other than word disambiguation. In addition, a 
methodology for automatically finding the best value of 
the compliance term will also be the subject of our future 



studies. 
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Table 1: Structural approach for discriminating senses of ambiguous words. Senses were characterized according to topological 
CN measurements |14| and the discrimination of senses was performed with low (kNN, C4.5 and Bayes) and high level classifiers. 
Note that the high level technique always outperforms the traditional low level classification. 
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Table 2: Semantic approach for discriminating senses of ambiguous words. Senses were characterized according to frequency 
of the n — 5 neighbors of the ambiguous word [14] and the discrimination of senses was performed with low (kNN, C4.5 and 
Bayes) and high level classifiers. Acc. Rate represents the accuracy rate obtained with an evaluation based on the 10-fold cross- 
validation technique [16] . The p- value refers to the likelihood of obtaining the same accuracy rate with an random classifier 
(see Ref. [l4] for details). Note that the high level technique always outperforms the traditional low level classification. 
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1.7 x 10- 1 


0.35 




kNN 


62.9 % 


4.2 x 10- 1 


69.4 % 


7.9 x 10- 3 


0.20 


present 


C4.5 


57.0 % 


9.6 x 10- 1 


61.4 % 


6.3 x HT 1 


0.15 




Bayes 


60.2 % 


7.6 x 10- 1 


65.3 % 


1.7 x 10- 1 


0.20 




kNN 


76.5 % 


6.3 x 10- 1 


87.3 % 


4.6 x 10- 2 


0.35 


jam 


C4.5 


76.5 % 


6.3 x 10- 1 


89.5 % 


4.4 x 10- 2 


0.40 




Bayes 


82.4 % 


4.1 x 10- 1 


90.1 % 


4.1 x 10- 2 


0.30 



p-6 



